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SMALL WORLD MCMC WITH TEMPERING: 
ERGODICITY AND SPECTRAL GAP 

By Yongtao Guan* and Matthew Stephens 

Baylor College of Medicine and University of Chicago 

When sampling a multi-modal distribution ir(x), x g R d , a Markov 
chain with local proposals is often slowly mixing; while a Small- World sam- 
pler (Guan and Krone, 2007) - a Markov chain that uses a mixture of local 
and long-range proposals - is fast mixing. However, a Small- World sampler 
suffers from the curse of dimensionality because its spectral gap depends on 
the volume of each mode. We present a new sampler that combines temper- 
ing, Small- World sampling, and producing long-range proposals from sam- 
ples in companion chains (e.g. Equi-Energy sampler). In its simplest form 
the sampler employs two Small-World chains: an exploring chain and a sam- 
pling chain. The exploring chain samples 714(2;) oc n(x) 1 ^, t £ [l,oo), 
and builds up an empirical distribution. Using this empirical distribution as 
its long-range proposal, the sampling chain is designed to have a stationary 
distribution n(x). We prove ergodicity of the algorithm and study its conver- 
gence rate. We show that the spectral gap of the exploring chain is enlarged 
by a factor of t d and that of the sampling chain is shrunk by a factor of t~ d . 
Importantly, the spectral gap of the exploring chain depends on the "size" of 
7r t (x) while that of sampling chain does not. Overall, the sampler enlarges a 
severe bottleneck at the cost of shrinking a mild one, hence achieves faster 
mixing. The penalty on the spectral gap of the sampling chain can be sig- 
nificantly alleviated when extending the algorithm to multiple chains whose 
temperatures {tk} follow a geometric progression. If we allow th — > 0, the 
sampler becomes a global optimizer. 

1. Introduction. Developing an algorithm to improve sampling efficiency for a high dimen- 
sional multi-modal distribution has been an active research area (Geyer, 1991; Marinari and Parisi, 
1992; Neal, 2001; Kou et al, 2006; Madras and Zheng, 2003; Guan and Krone, 2007; Andrieu 
et al, 2008; Woodard et al, 2008; Brockwell et al, 2010; Del Moral and Doucet, 2010). In this pa- 
per we introduce a new sampling algorithm and study its ergodicity and convergence properties. The 
new algorithm combines several existing ideas: tempering (Geyer, 1991; Marinari and Parisi, 1992), 
propagating information between chains via empirical distributions (Kou et al, 2006), and Small- 
World sampling (Guan et al, 2006; Guan and Krone, 2007). A Small- World sampling combines 
local and long-range proposals in a Metropolis-Hastings algorithm. Guan and Krone (2007) showed 
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that a Small-World chain converges faster than a Metropolis-Hastings chain with local proposals 
when sampling a multi-modal density. 

The new algorithm is simple and easy to implement. In its simplest form the algorithm has two 
Small-World chains: the first chain samples a fattened (or flattened) distribution (via tempering) with 
its long-range proposals generated by a typical heavy-tailed distribution; the second chain samples 
the target distribution with its long-range proposals randomly drawn from samples of the first chain. 
Intuitively, the first chain identifies and remembers (via its empirical distribution) modes of the 
distribution - it does so more effectively than an unheated chain because it accepts more long-range 
proposals that tend to move between different modes. This knowledge of whereabouts of different 
modes is used by the second chain through its long-range proposals. 

The new algorithm bears similarity with the Equi-Energy sampler. Indeed, our original goal was 
to prove the ergodicity of the Equi-Energy sampler - the ergodicity proof in the Equi-Energy paper 
was incomplete and the amended proof Atchade and Liu (2006) has difficulties. Identified these 
problems, Andrieu et al. (2008) proved ergodicity of the Equi-Energy sampler using the Poisson's 
Equation. By studying properties of perturbed kernels, Fort et al. (2010) proved ergodicity of a 
class of adaptive MCMC algorithms, including the Equi-Energy sampler. Our proof uses mixingale 
theory, built on the work of Atchade and Liu (2006) and Atchade and Rosenthal (2005), which 
allows us to study the asymptotic convergence rate. The convergence results identified that local 
proposal in the feeding chain of Equi-Energy sampler slows down the convergence in a multi-modal 
distribution. Thus we propose using a small-world sampler as the feeding chain. 

The paper contributes both practically and theoretically in sampling high dimensional multi- 
modal distributions. First, we provide a simple and easy-to-implement algorithm that is effective 
for challenging problems. Second, we prove ergodicity of the algorithm and analyze its conver- 
gence rate. The result on convergence rate provides important insights into when, why and how the 
algorithm will improve convergence in practice. 

We first formally set up the problem. Let £>(M d )) be the state space equipped with its a- 
algebra. A Metropolis-Hastings algorithm (Metropolis et al., 1953; Hastings, 1970) aims to sample 
from a probability measure ir that admits a density with respect to Lebesgue measure that is only 
known up to a normalizing constant. We use n to denote a measure and tt(x) to denote a density 
and it should be clear from the context. The transition kernel of a Metropolis-Hastings chain is 

(1) P(x, dy) = a(x, y) k(x, dy) + r(x) 5 x (dy), 



w(y) k(y,x) 



where k(x,y) : W 1 x R d — > [0,oo) is a proposal distribution, a(x,y) = min fl, k ^ x ^ 

is the acceptance probability of a proposed move y, the 6 X is the point mass at x, and r(x) = 
J Rd (1 — a(x,y))k(x,dy) is the probability that y being rejected. Obviously k(x,y) determines 
P(x,dy). 

Let / : M. d — > E denotes a measurable function. For a signed measure p and a positive function 
V, define the F-norm of p. by \\p\\v '■= supj<y \p{f)\ where p(f) = J Rd f(x)p(dx). A transition 
kernel P acts on / such that Pf(x) = f Rd f(y)P(x, dy). Given two transition kernels P, Q, the 
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product PQ is also a transition kernel (PQ) (x, A) = f Rd P(x, dy)Q(y, A). This allows us to define 
a product of kernels of countable many through induction. 

1.1. Minorization and drift conditions. We assume the following assumptions hold: 

Al A probability measure ip exists on B such that the Markov chain is -^-irreducible and aperiodic 

(c.f. Meyn and Tweedie, 1993). 
A2 Minorization Condition: there exist a set C 6 6 and e > such that, for the same tp in Al, 

we have ip(C) > and, for all A G B, x G C, 

(2) P(x, A) > e ip(A). 

A3 Drift Condition: there exist a measurable function V : M. d — > [1, oo) such that n(V) < oo 
and constants A G [0, 1) and b G (0, oo) and same set C in A2 satisfying 

(3) PV(x) < XV(x) + bl c (x). 

The minorization condition ensures the flow from the set C to outside is lower-bounded, an idea 
intimately connected to conductance (see Section 4). The drift condition guarantees that the Markov 
chain evolves towards C. The drift condition is necessary to define y -geometrically ergodic. The 
above minorization and drift conditions can be checked for many practical problems. For example, 
if P is a Random Walk Metropolis kernel, both (2) and (3) are known to hold under some regularity 
conditions on the target densities (see Atchade, 2010). 

1.2. Spectral gap. A homogeous Markov chain that satisfies (A1-A3) is geometric ergodic (c.f. 
Meyn and Tweedie, 1993; Roberts and Rosenthal, 2004). Let L 2 (tt) denote the space of measurable 
functions on ~R d with f Rd f(x) 2 ir(dx) < oo, with inner product (/, g) = f Rd f(x)g(x)i:(dx), and 
norm ||/|| = (/, f) 1 ^ 2 . The operator P being reversible with respect to tt is equivalent to P being 
self-adjoint. It is well known that the spectrum of P is a subset of [—1, 1]. (Self-adjoint implies its 
spectrum is real, and being a Markov transition kernel determines the range.) A chain is said to be 
L 2 {tt) -geometrically ergodic if there exist a constant p G (0, 1) and a positive M < oo, and V{x) 
defined in (A3) such that 

(4) \\P n (x,-)-TT(-)\\ v <Mp n V(x). 

Define L^(vr) = {/ G L 2 (tt) : (f, 1) =0}. Denote by P the restriction of P to Lg(vr). It has 
been shown (Roberts and Rosenthal, 1997; Roberts and Tweedie, 2001) that for reversible Markov 
chains, geometric ergodicity is equivalent to the condition 

(5) lll^o|||= sup ||P /||<1. 

/e£§M,||/||<i 
The spectral gap of the chain P is defined by 

(6) Gap(P) = l-|||P |||. 

The spectral gap determines the convergence speed of a MCMC algorithm. Very roughly, a chain is 
close to equilibrium after a few multiples of l/Gap(P) iterations (Madras and Randall, 2002). 
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1.3. Piecewise log-concave distributions. We assume the target density ir(x) is a mixture of 
log-concave densities. A function / : R — > (0, oo) is log-concave if for any s € [0, 1], 

(V) f( S x + (l-s)y)>f(xrf(y) 1 - s . 

Let | • | be a metric on M d , / is a-smooth if | log f(x) — log f(y)\ < a \x — y\ for all x,y £ M d . 
Let {Ai, . . . ,A m } be a partition of M d . Let tvm : Ai — > (0,oo) be an a-smooth log-concave 
density with barycenter /3, = J A ir^(dx) and theirs? moment Mi = |a: — f3i\it^(dx). Let 
/% = | A — /3j|, i 7^ j and = max(/3jj). Let wi > 0, the distribution of interest is 

m 

(8) 7r(x) OC ^7T W (x)1a,(x) Wj. 

i=l 

Define average proposal distance of k(x,y) as D = / K d x jjd |^ — y\k(x,y)dxdy. Then k(x,y) is 
/oca/ if D < min{Mj} and long-range if D > ,5^. We call a chain local chain if it only uses local 
proposals. By size (or complexity) of 7r(x) we mean the quantities associated with ir(x) that will 
affect the spectral gap. Namely, the measure of the steepness of each mode a and the measure of 
the distances between modes f3ij. We treat dimensionality of M. d and number of modes m as fixed 
quantities. 

1.4. State decomposition theorem. We describe the "pieces" of a Metropolis-Hastings chain P 
by defining, for each i = 1, . . . , m, a new Markov chain on Ai that rejects any transitions of P out 
of Ai. The transition kernel Pa z of the new chain is given by 

(9) P At (x,B) = P{x,B) + l B (x)P(x,A C i) for x £ Ai,B C Ai. 

The movement of the original chain among the "pieces" can be modeled by a "component" Markov 
chain with state space {1, . . . , m} and transition probabilities: 

(10) P ^i) = V-TT\l PfaAjWdx), fori^j, 

2 7T(AJ J Ai 

and P c (i, i) = 1 - P c (i,j). 

THEOREM 1.1 (State Decomposition Theorem). In the preceding framework, as given by Equa- 
tions (9) and (10), we have 

(11) Gap(P) > ^Gap(P c )(min i=1) ... im Gap(P Ai )). 

Guan and Krone (2007) proved this state decomposition theorem, generalized it from its original 
version (Madras and Randall, 2002). The theorem says that we can bound the spectral gap by taking 
into account of the mixing speed within each mode and that of among different modes. In Section 
4, we will focus on Gap(P c ) because it causes slowly mixing. 
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1.5. Tempering and Equi-Energy sampler. Since Geyer (1991) and Marinari and Parisi (1992), 
tempering has become a popular technique. Here we use two chains to illustrate tempering, although 
it usually employs multiple chains (Geyer, 1991). 

ALGORITHM 1.1. Let s 6 (0, 1). Let x c and Xh be current states of the cold and hot chains that 
sample ir(x) and TTt(x) oc vr(x) 1 /* (for some t > 1) respectively. Repeat the following steps. 

1. Simulate u ~ Unif [0, 1). 

2. Ifu<s, update x c and Xh independently using the Metropolis-Hastings algorithm. 

3. If u> s, compute a = 7rfe) an< ^ swa P Xc an d x h with probability min(l, a). 

The hot chain samples a flattened distribution; the flattening makes it easier for a local chain 
to move around to discover new modes. This easiness in the hot chain transfers to the cold chain 
through successful "swap". However, tempering has at least two limitations. First, because the pro- 
cess is "memory-less", modes need to be repeatedly rediscovered. Second, the cold chain interferes 
with the hot chain because of the "swap", which may slow down mixing. The Equi-Energy sampler 
(Kou et ai, 2006) resolves both limitations using empirical distributions. The Equi-Energy sampler 
runs multiple chains of different temperatures and samples of each chain are recorded and classified 
into different energy rings according to their energy level (log density). In addition to local propos- 
als, the Equi-Energy sampler has "equi-energy jumps", when a lower temperature chain proposes a 
new move by randomly drawing a sample from an energy ring in a higher temperature chain. Thus, 
the relative easiness of moving between different modes in hot chains propagates down to the cold 
chain. Note a high-temperature chain affects the proposal of a lower-temperature chain; while a 
lower-temperature chain does not affect a high-temperature chain. 

The rest of the paper is organized as following. In the next section, we describe the algorithms 
and main results and compare the algorithm with the Equi-Energy sampler. In Section 3 we prove er- 
godic theorems and discuss the convergence rate. In Section 4 we prove theorems regarding spectral 
gaps of the algorithm. In Section 5 we discuss practical issues and applications. A short discussion 
will conclude the paper. 

2. Algorithms and Main Results. The new algorithm combines a Small-World sampler with 
tempering and we call it "Small-world Tempering with Empirical Ensemble Propagation", or STEEP. 
It has a backronym "Small-world Tempering with Equi-Energy Program" to credit the Equi-Energy 
sampler. We begin with the Small-World sampler. 

ALGORITHM 2.1 (Small World Sampler). Let s G (0, 1) and the current state is X n = x. Let 
l(x, y) and h(x, y) be local and long-range proposals respectively. 

1. With probability (1 — s) simulate yfrom density l(x, y) and compute a = ^4 [fe 2 ^ - 

2. Otherwise, simulate yfrom density h(x,y) and compute a = h(xy) ' 

3. Set X n+ \ <— y with probability min(l, a), otherwise set X n+ \ x. 
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4. Set n 4— n + 1, goto step 1. 

When sampling a multi-modal density, both simulation studies (Guan et al., 2006) and theoretical 
work (Guan and Krone, 2007) have shown that a Small- World sampler converges faster than a local 
chain. Intuitively, the local proposals allows the chain to better explore a mode, while the long-range 
proposals allows it to jump between modes more easily. However, if modes are steep and far apart, a 
Small-World sampler will fail because that successful transitions between modes (initiated by long- 
range proposals) depend on their volumes (Guan and Krone, 2007) - to increase volumes of modes, 
tempering does the trick. 

2.1. The STEEP Algorithm. The simplest form of STEEP employs two chains that run simul- 
taneously. Both are Small- World chains because their proposal distributions are mixtures of local 
and long-range. The exploring chain samples a fattened target 7r t (x) oc ir(x) 1 ^ t for some t > 1, 
where the long-range proposals are drawn from a typical heavy-tailed distribution (e. g., Cauchy). 
Samples collected in the exploring chain induce an empirical distribution £. The sampling chain 
samples ir(x), using £ to simulate its long-range proposals. Because the similarity between irt and 
7T, in particular, their modes are in the same places, the empirical distribution provides natural and 
"intelligent" long-range proposals for the sampling chain. In detail: 

ALGORITHM 2.2 (STEEP with Two Chains). Let ir t (x) oc ir(x) 1 / t for some t > 1. {Y n } and 
{X n } sample irt and n respectively, and Y n = y n and X n = x n . 

1. Simulate Y n+ i\Y n = y n using Algorithm 2.1 to obtain y n +\. Update empirical measure 
Cn+i = n/(n + l)£ n + l/(n + l)6(y n+ i). 

2. Simulate X n+ i\X n = x n using Algorithm 2.1. Modify step (2) so that y ~ £ n +i and compute 
a = ^(y) • ^ et X n +i V with probability min(l, a), otherwise set X n+ \ <— x n . 

3. Set n <— n + 1, goto step 1. 

The algorithm 2.2 is similar to the algorithm in section 3.3 of Andrieu et al. (2007). The differ- 
ence here is that we emphasize the long-range proposals. 

REMARK 1 (Ergodicity). The exploring chain is a homogenous Markov chain with station- 
ary distribution TTf Conditional on the exploring chain, the sampling chain is a non-homogeneous 
Markov chain, whose transition kernel evolves over time because it depends on £ n . Since £ n — > lit, 
the sampling chain converges to a Small-World chain with long-range proposal irt and stationary 
distribution it. As a result, if we run Algorithm 2.2 long enough the sampling chain should generate 
samples approximately from ir. This intuition is formalized in the ergodic theorem in Section 3. 

Since the exploring chain samples a fattened distribution irt, each mode is larger and hence more 
easily found by long-range proposals. This implies a better mixing compared to directly sampling 
7r. We quantify the intuition in the following theorem. 
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THEOREM 2.1. Denote E the exploring chain in Algorithm 2.2 with stationary distribution 
■Kt, and denote E c the component chain of E (see Section 1.4), then Gap(£' c ) > c\ t d for some 
constant c\ that is independent oft and d. 

However, a large t increases dissimilarity between 7r t and ir. Large dissimilarity reduces accep- 
tance ratio of long-range proposals in the sampling chain - causing slow mixing. We quantify this 
intuition in the following theorem. 

THEOREM 2.2. Denote S an (idealized) sampling chain in Algorithm 2.2, using irt(x) as its 
long-range proposal instead of £, and denote S c the component chain of S (see Section 1.4), then 
C2t~ 2d < Gap(»5 c ) < C3 t~ d for some constants C2, C3. In addition, Gap(S' c ) is independent of the 
size of T[(x). 

REMARK 2 (Equilibrium Assumption). Note that Theorem 2.2 considers a Small-World chain 
that uses nt as its long-range proposal. The connection with Algorithm 2.2 is that, as the number 
of iterations increases, the long-range proposal used by the Algorithm becomes an increasingly 
close approximation to nt (indeed £ n — > lit weakly). Thus, intuitively, the result in the theorem 
represents the "asymptotic" behavior of the algorithm. 

Obviously, for Algorithm 2.2 to converge quickly, both the exploring chain E and the sampling 
chain S must converge quickly. Therefore, we want to increase g = min(Gap(i? c ), Gap(i? c )). 
Suppose we set t = 1 in Algorithm 2.2, then the (idealized) sampling chain converges instanta- 
neously (because it proposes from the target distribution). We have g = G&p(E c ), which is spectral 
gap of a Small-World chain. When we increase t in Algorithm 2.2, we enlarge a small spectral gap, 
Gap(i? c ) , and pay a penalty by shrinking a large spectral gap, Gap(5 c ), to achieve faster mixing. 
This trade-off works well because G&p(E c ) depends on target density and can be extremely small; 
while Gap(£ c ) is independent of the target density. However, when d is large, even for a modest 
t, the penalty on Gap(5 c ) can be large. This lead us to extend the algorithm to multiple chains, 
instead of just two. 

ALGORITHM 2.3 (STEEP). Let {U} be a sequence of positive numbers with 1 = to < t\ < 
• • • < ttf. Let {Xn^} sample 7Tj(x) oc ^(x) 1 /**. Let £^ be the empirical measures by the i-th chain. 
We call the H-th chain exploring chain, the rest sampling chains. 

(H) it ( H) ( H) 

1. Simulate X n+l \X^ = x\ with Algorithm 2.1 to obtain x n+l . Update empirical measure 

C\ = »/(« + + v(n + m*£b- 

2. For each i = H - 1, H - 2, . . . , 1, 0: 

(i) (i) (i) (i) 

• Simulate = x n using Algorithm 2.1 to obtain x n+1 . Modify step (2) so that 

^ and commute n - 7ri(y) 7r '+ l(x " )) 

V ~ Wi compute a - n i+1 ( y ) ■ 

• Set X^+i ^— y with probability min(l, a), otherwise X^ x <— if, . 
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• Update empirical measure q£Li = n/(n + l)£n + l/(n + 
3. Se? n -<— n + 1, goto step 1. 

We may determine optimal temperatures {ij} in Algorithm 2.3. Assuming tu fixed and all chains 
reach equilibrium, denote Gi a Markov chain that samples TTi(x) with long-range proposal Wi+x{x), 
and let g^, be the spectral gap of the component chain of Gi, for i = 0, ■ ■ ■ , H — 1. Theorems 2.2 
implies that 

(12) 50 = c (tiAor d ,5i = ci(t 2 /ti)" d ,...,5H-i = c H -i(t H /t H - 1 )- d , 
for some constants co, . . . , ch-%. We want to choose {ij} to optimize 

(13) ^ = min(5i, . . . ,g H -i)- 

Assume c\ = ■ ■ ■ = ch in Equations (12), then for a fixed tjj, we have go = • • • = Qh-\ 
maximizes g defined in (13). This implies that {ti} is a geometric progression because of (12) and 
assumption on Cj's. Such a geometric progression on adjacent temperatures has been suggested in 
both parallel tempering (Predescu et al, 2004) and the Equi-Energy sampler (Kou et al, 2006). The 
former is based on the optimality of the acceptance ratio of swaps between adjacent chains and the 
later is based on empirical evidence. Our argument here provides an additional justification based 
on spectral gap. 

The exploring chain in Algorithm 2.3 can benefit from a large tjj with a modest r = ifc+i/ift and 
a modest H thanks to the geometric progression. A large tjj makes the "exploring" easier, while the 
penalties spread out across sampling chains. We will discuss how to choose tjj and r in practice in 
Section 5. 

2.2. Finding Global Optimum. We may extend the STEEP to find the global optimum of a 
density tt{x) - tempering does the trick. Specifically, in Algorithm 2.3, instead of stop at t = 1, we 
continue the process at t = r~ k for k = 1, • • ■ , C. As k increases, each mode becomes steeper and 
the probability concentrates around the global maximum. The samples collected in the last chain 
will be very close to the global optimum. We have the following Lemma. 

LEMMA 2.3. Let tf. = T k for k = H, H — 1, • • • , 1, 0, —1, ... — C, then for a sufficiently large 
C, the samples collected by the last chain in Algorithm 2.3 will be arbitrarily close to the global 
maximum. 

PROOF. Let ir(x) be an a-smooth log-concave density that attains its unique maximum at 0. 
Assume irt(x) oc tt(x) 1 ^. It is sufficient to show that for any e > there exists 5 > 0, such 
that for any < t < 5 we have f B 7Tt(x) > 1 — e, where B t is a ball centered at with radii e. 
However, the claim hold trivially because 7r t (x) decays exponentially at rate a/ 5 which can be made 
arbitrarily large by letting 5 — > (or equivalently increasing C). Extension to mixture of a-smooth 
log-concave densities is trivial if the mixture has a unique global optimum. If the mixture density 
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has multiple global optima, then each global optimum Xj has a ball Bj >e centered around it and the 
samples collected will concentrate in Bj t6 . And each global optimum can be found. □ 

2.3. Connection with the Equi-Energy sampler. Although our algorithm shares several similar- 
ities with (and benefits of) the Equi-Energy sampler, there are also several differences. First, our 
algorithm is conceptually simpler, effectively because it dispenses with the energy ring and energy 
cut-off of the Equi-Energy sampler. This makes it easier to understand, and perhaps also slightly 
easier to implement, which we believe could facilitate its more widespread use (despite its appeals, 
the Equi-Energy sampler appears not to have been used very extensively in practical applications). 
Second, from a more technical perspective, our algorithm makes use of a small-world (fast converg- 
ing) chain as the exploring chain. This has two advantages. First, Theorem 2.1 gives us insights on 
why the tempering helps, while such a theorem does not hold for a local chain used in the Equi- 
Energy sampler. Second, it has been argued (Jarner and Roberts, 2007) that, if the target distribution 
7T is heavy tailed, the proposal distributions should have heavy tails as well. When combining high 
temperature with energy cutoff, the target distribution is likely to behave like a heavy tailed dis- 
tribution, which will be a challenge for a local chain, while not much so for a Small-World chain. 
In addition, STEEP provides a different perspective in that every chain in STEEP is a Small-World 
chain, and a hot chain provides informed long-range proposals for the adjacent cold chain. However, 
in practice, at least for the simple examples given in the end of the paper, we would not expect much 
difference in performance between STEEP and the Equi-Energy sampler. 

3. Ergodicity. We devote this section to state and prove ergodicity, Theorem 3.5, for Algorithm 
2.3 (multiple chains). We first extend a key result of (Atchade and Rosenthal, 2005, Lemma 3.1) 
to two chains (Algorithm 2.2) to obtain Theorem 3.2. Then we extend a theorem in (Atchade and 
Liu, 2006, Theorem 3.3) to ^-geometric ergodic to obtain Theorem 3.4. Our proof follow closely to 
theirs. There are two technical lemmas: Lemma 3.1 contributes to prove Theorem 3.2; and Lemma 
3.3 helps Theorem 3.2 to prove Theorem 3.4. We then state and prove Theorem 3.5. We conclude 
this section by discussing convergence rate of Algorithms 2.2 and 2.3. 

In Algorithm 2.2 {Y n } is the exploring chain and {X n } is the sampling chain. Let Xq = xq, 
Yq = 2/0- Let {E n } denote a sequence of operators that generates {Y n }, where E\ may be identical. 
Denote E\. n = E\ ■ ■ ■ E n . Recall £ n is an empirical measure generated by process {Y n }. Let {P^ n } 
denote the sequence of operators that generates {X n }, where 

(14) P£ n (x,A) = (1 - s)T(x,A) + sK in (x,A), 
where T is a local chain with stationary distribution it and 

-i n \ n 

(15) K^f(x) = - V a(x, yj )f( yj ) + -/(ar) V (1 - a(x, Vj )), 

n n 
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where y/s are samples that induce empirical measure £ n , and a[x, y) = min ^1, ^jfj) ■ ^ or a 
sequence of operators {P n }, denote P^j = PjPj+i ■ ■ ■ Pj for i < j, Pj.j = Pj, and Pj.j = / if i > j. 
Let = ct(Xl, . . . , X n ) and = cr(Y\, . . . , 1^) be the filtrations of process {!„} and {Y n } 
respectively, and let T n = cr{X\, . . . , X n , Yi, . . . , Y n ). 
The following Lemma is needed to prove Theorem 3.2. 

Lemma 3.1. Define a n := ||£ n — £ n -i||v> an d assume there exist constants M < oo and 
p 6 (0, 1), such that for each x £ X ||P| (x, •) — 7T£ n (-)||y < Mp 3 V(x) P yo -a.s., then 

1. a n is J 7 ^ measurable and¥, yo (a n ) < 0(l/n). 

2. \\P^,-)-P^ 1 ^r)\\v<2sV(x)-a n . 

3. ||vr^ n — 7r^ n _ 1 ||y < M a n for some constant M . 

PROOF. a n = SU V]f]<v \Uf) - Cn-l(f)\ < k\ V ( Y n) + ££"l V ( Y i)\ and note that 
E yo (V{Y n )) < oo for all n and that E yo (V(Y n )) -> E(V) < oo as n ^ oo, and claim (1) fol- 
lows. Let | /| < V, then 

l^nfo/)--P*n-l(*>/)l 

a{x,y)f{y)[i n - £ n -i](dy) + f[x) j [1 - a(x,y)][^ n - £ n _i](dy) 

[/(l/) + /(*)]& 



(16) 



< s 



< 2 sV(x) 



v(i/)Kn-e«-i](dy) 

<2 S ^(x)||£„-£ n _i||y, 
and the conclusion (2) follows. From Lemma B.l. of Andrieu et al. (2007), we have 

Kit - *-,)<*. /)l * « ■» 

for all k. Let A; — > oo and combine with (2) to get desired result (3). 

THEOREM 3.2. In the proceeding framework, and assume 

1. For each x £ X, \\Ei :n (x, •) — vrt(-)||\/ — > 0, as n — > oo. 

2. There exist constants M < oo a«<i p G (0,1), smc/z that for each x £ X ||P| (x, •) 

7Tf n (.)llv<^V(x)P M - a .5. 



□ 



addition, for finite constant c\ , c% define 
(17) 
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Define g k ^ k = f - 7T^(/). Then for any \f\ < V, we have 

(18) \^(x ,y )(gn+j,t n+3 (X n+J )\T n )\ < B(k U k 2 ,j)V(X n 

P xo ,yo a - s - an d as an immediate consequence, 

(19) \E (xQiyo) (f(X n ) -7T£„(/)) I < B(k 1 ,k2,n)V(x ). 
In addition, 



(20) 



1 - 

(/(**) -7r ei (/))->-0 asn^oo,P a 



•'-'() •.Vo 



a.s. 



Proof of Theorem 3.2. For notational convenience, we denote P^ n by P n in the following 
equation. 



^x {gn^ n { X n+j)\^n) = P n :(n+j-l)f(X n )-TT£ n (f) 

i-i 

<|^(P n fe -^J(P ri+fe -P ri )P (ri+fe+1):(n+i _ 1) /(X n ) + 7Tf„(/) 
fc=i 
, i- 1 



< | ^ M^P* - 7r Cn )(P n+fe - P n )F(X n ) + M 2 ^F(J!r„ 

fc=i 
i-i 

< M 3 ka n (P* - 7r in )V(X n ) + M 2 ^V(X n ) 

i-l 

< M 4 a n F(X n ) ^ /A; + M 2 ^V(X n ) 



fe=i 



< [M 4 



a,. 



(1 " P) 2 



M 2 p^(X n ) 



where a n is defined in Lemma 3.1 (1). The transition between the first and the second line comes 
from the well known identity P% :n = Ylk=i ^li^k+i — Pl)P(k+2):n+Pii i n the transition between 
the second and the third line, we recursively apply drift condition (A3); and in the transition between 
the third and the fourth line we applied Lemma 3.1 (2) with telescoping sum. Now taking into 
account of Lemma 3. 1 (3) and again with telescoping sum, we have 



(21) 



^o(9n +j ,t n+j (X n+j )\T n ) < [M 4 —^ + M 2 p>]V(X n )+jM, 
< [M 5 a n j + M 2 p>]V(X n ), 
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where M5 = 2 max(M4/ (1 — p) 2 , M). Using the filtration trick in the end of Lemma 3.1 of Atchade 
and Rosenthal (2005), we get for k = 1, . . . , j 



E XQ (9n+j, £ n+j (X n +j) \F re) — E XQ \E XQ \9n+j, ?„ +j {Xn+j)\F n +j-k) \F n 

< E xo [ E xo (g n +j, e, n+j {X n +j)\F n+j-k^j \F n ] 

< mm 1 < fc < i [M 5 Q! w+i _ fc k + M 2 p k ]E X0 {V(X n + j - k)\F n ) 

< nuni< fc <j[M 5 a< n+i _ fc k + M 2 p k ]V{X n ). 
Taking expectation with respect to process {Y n }, we get 

\^x , yo (g n +j, £ n+j (X n +j)\J r n)\ < ^{min^k^ j[M 5 a n+j _ k k + M 2 p j ]}V(X n ) 

(22) k 

< mm 1<k<j [M 6 —. + M 2 p>]V(X n ), 

~ n + j — k 

which is claim (18). Taking n = in (22), we obtain (19). 

Note B(a,c 2 ,n) — > as n — > 00 (see Section 3.1), this and (22) show that 

(23) E Xo , yo (f(X n ) - vr 5n (/)) -> 0, as n -> 00. 

Following argument similar to the proof of Theorem 3.2 of Atchade and Rosenthal (2005), write 
Zn = g n £ n {Xn) - E Xo , yo (g n £ n (X n )). Given (22), (Z n ,T n ) is an L 2 -mixingale with mixingale 
sequence c n being a constant and ip n = B{c\, c 2 , n). Then Corollary 2.1 of Davidson and de Jong 
(1997) implies that 

1 n 

(24) -Y,(gk£ k (X k )-E 

%o,yogk,£,k(Xk)) ~^ 0, Px ,y Q a - s - as n — )• 00. 

n k=l 

Combine (23) and (24) we obtain (20). □ 

The following lemma is needed to prove the Theorem 3.4. It is a technical lemma that is an 
extension of of Lemma 3.1 in Atchade and Liu (2006). 

LEMMA 3.3. Let {f n } be a sequence of measurable functions and let {p n } be a sequence of 
probability measures such that \f n \ < V and f n —*f pointwise and p n — > p setwise. In addition, 
fV(x)p n (dx) < 00 and JV(x)p(dx) < 00. Then j f n (x)p n (dx) — > j f(x)p(dx). 

Proof of Lemma 3.3. In light of Proposition 18 in Chapter 11 of Royden (1988), it is suf- 
ficient to prove that J V(x)p n (dx) — > J V(x)p(dx). However, setwise convergence of p, n — > p, 
implies that for any simple function tp, we have J i^p n (dx) — > f tpp(dx). Since simple functions 
are dense (in LP), we have two sequences of simple functions 2V > t/j m > V and (j> m < V such 
that ip m \ V and cf> m /• V, and f ip m p n {dx) > f V p n (dx) > J 4> m p n (dx). Let n — > 00, we have 
/ ipmfJ'(dx) > lim n J Vp n (dx) > J 4> m p(dx) for any m. Now let m — > 00 and apply Lebesgue's 
dominant convergence theorem to finish the proof. □ 
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THEOREM 3.4. With the setting outlined as in Theorem 3.2, assume: 

1. For any x G R rf and A G B, P^ n (x, A) — > Pg(x, A) as n — )• oo, P yo -a.s. 

2. There exists a finite constant M and a p G (0, 1) such that \ \P]?(x, •) — vr(-)| |y < MV{x)p k 
and\\Pl(x, •) - 7r^(-)||y < M^a;)/ P yo -a.s. 

Then for any measurable function f : X — > 1Z such that \ f\ < V we have 



1 ™ 

(25) E, ,, [/(X„)]^7r(/) and f(X h ) -+ tt(/) P 



fc=i 



a.s. 



Proof of Theorem 3.4. Forany|/| < V, by Theorem 3.2 we have E xom [f(X n )-TT^ n (f)} ->■ 
and - $^ =1 [/(^fc) — ^ fe (/)] ~~ >■ -fa;o,i/o" a - s - as ra — ^ oo. To finish we need to prove that 
^Znif) ~^ n (f) -fyo" a - s - as ra — >• oo. By assumption we have P^ n f(x) — > P^f(x) P yo -a.s. for all 
x G R d . By Lemma 3.3, Pj? n f(x) = Pg n (P^/(x)) — > P|/(x). By recursion, for any k > 1, 

(26) P£/(x) ^ P*/(x) P w -a.s. as n -> oo. 

We have 

- < |vr 5n (/) - P£/(x)| + |P|/(x) - tt(/)| + |P£/(x) - P|/(x)| 

<2MV>)/ + |P|j(x)-P*.f(x)|. 

Combine above with (26) we have |7T£ n (/) — 7r(/)| — >• P yo -a.s. □ 

Recall in Algorithm 2.3, the PT-th chain is a homogeneous chain. Conditioning on the realization 
of {Xn +1 ^} (for i < H), {Xn } is a nonhomogeneous Markov chain with transition kernel P„ . 
The P® operates on / such that: P$f(x) = (1 - s)T^f(x) + siT { 2 +1) /(x), where 1^) is a 
homogeneous local chain with stationary distribution 7T£, and 

i n 1 n 

(28) i^U/M = " E a ( x < + E t 1 - a (*> 

i=i i=i 

where yj's are samples that induce empirical measure £,n +1 \ and a(x, y) = min ^1, 

Let (x, A) = (1 - s)rW/(x) + si£^] +1) /(x) be the limiting transition kernel as n — > oo where 



K J+i)f(x)= a{x,y)f(y)ir i+ i{dy) + f(x) / (1 - a(x, y)) 7r i+ i(rfy). 
We have the following ergodic theorem. 

imsart-aap ver. 2011/11/15 file: steep_aap.tex date: November 21, 2012 



14 



GUAN AND STEPHENS 



THEOREM 3.5. In proceeding framework, assume T^\i £ {0, . . . , H}, satisfies assumptions 
(A1-A3), then for \ f\ < V, as n — > oo, 

1 n 

(29) E[/(XW)] -> n (f) and -^W)"^^/) 

fe=i 

Proof of Theorem 3.5. Let be the starting point of process {X^ }. It is easy to check 
the detailed blance equation holds for the process {Xn }, so the claim holds for chain H. 

For each % < H, we want to show that the i-th chain in STEEP algorithm satisfies two assump- 
tions in Theorem 3.4. 

Since process {Xn +1 ^} has stationary distribution 7Tj + i, we have for all x G R d , P,i l+1 ^(x, •) — > 
7Tj + i . Since a(x,y) as in (28) is bounded, by Lebesgue's dominate convergence theorem, K^ i+1) f(x) — > 

K^ +1) f(x), P x ( i+ i)-a.s., which implies P»/(z) -> (1 - s)T«/(x) + sK^ +1) f(x) P^+D-a.8., 

and hence Pn\x, A) -)■ P^(x, A) P ( i+ i)-a.s. for each x <ER d and ^4 G B. So the Assumption (1) 
hold true. 

The drift and minorization conditions on rW transfers to P^, so that each P$ admits an invari- 
ant distribution 7Tj in P ( i+ i)-a.s. and is geometric ergodic. The limiting transition kernel pW also 

' x o 

inherit the drift and minorization conditions on TW and admits invariant distribution 7Tj P (i+ i)-a.s. 

x o 

So the Assumption (2) hold true. 

By induction, Theorem 3.5 now follows Theorem 3.4. □ 

3.1. Convergence Rate. The convergence of Algorithm 2.2 was broken down into two parts. 
The first part is (P^ • • • P^ n )(x, •) — > 7r^ n (Theorem 3.2), which is the convergence of the sampling 
chain; and the second part is 7T£ n — > 7T* (Theorem 3.4), which is the convergence of the exploring 
chain. We shall discuss them separately. 

First note the exploring chain in Theorem 3.2 is more general in that it can be either a homogenous 
or a non-homogenous chain. From (27) and Lemma 3. 1 

\HM) - <f)\ <2MV(x)p k + |P*/(x) - Pj?f(x)\ 

(30) <2MV{x)p k + 2sV{x)\\i n -i\\ v 

<(ci P k + C 2 Un-av)V(x), 

where for any |/| < V, we have £(/) = TVt(f) P^-a.s. Clearly, the convergence rate of — > it is 
dominated by the rate C2\\Cn — £\\v-> which depends on the convergence rate of the exploring chain. 

In light of the discussion in Atchade and Rosenthal (2005), Theorem 3.2 implies that the sampling 
chain in Algorithm 2.2 converges (to 7r^„) at rate 

(31) B(n, p) = mini< fe < n (ciA;/ (n-k) + c 2 p k ). 
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Take derivative with respect to k and set to to get 

n 1 k 

(32) c\- — ^=c 2 log-p. 

[n — ky p 

If we assume k = 0(n), then (32) reduce to - ph cp k , solve to get k = 0(log (n)), and we reach 
contradiction. So we may assume k = o(n), and (32) simplifies to « c 2 log - p k . Take log on 
both sides and solve to get k « — log n/ log p. Substitute back to (31) and use log (1 — x) rj — x 
when x small and note p k rs 1/n we get 

(33) B (n,p)«0(!L^). 

1-p 

Loosely, (1 — p) Gap(P^), where is the limiting transition kernel of the sampling chain. From 

(33) the sampling chain of Algorithm 2.2 converges at a polynomial rate that is also proportional to 
1 / Gap(Pg). Note the rate is not a function of the size of the n(x). 

Extending to Algorithm 2.3, the above analysis suggests that the sampling chains converge at rate 
0(r Xd n _1 logn) for some A € [1, 2] (Theorem 2.2), where r > 1, is the ratio between adjacent 
temperatures. A large r slows down the convergence of the sampling chains. However, the spectral 
gap of the exploring chain is enlarged by a factor of r Hd (Theorem 2.1). Hard problems will benefit 
from such a trade-off because their exploring chains usually have very small spectral gaps. 

4. Spectral Gaps. Our main aim in this section is to prove Theorems 2.1 and 2.2. We rely on 
the state decomposition theorem (Theorem 1.1) to analyze the Algorithm 2.2: we partition a multi- 
modal distribution into pieces log-concave pieces and analyze each restricted local chain P Ai and 
the component chain P c separately. The Gap(P ( 4 i ) is well studied (Lovasz and Vempala, 2003; 
Mathe and Novak, 2007; Rudolf, 2009; Guan and Krone, 2007) using the isoperimetric inequality 
(Lovasz and Simonovits, 1993; Kannan et ai, 1995), and Gap(P c ) can be computed directly using 
conductance and the Cheeger's inequality. 

4.1. Conductance and spectral gap. Recall P(x, dy) defined in (1), for A G B with tt(A) > 0, 
define 

(34) h P (A) = -}— [ P(x,A c )7r(dx). 

ntA) J A 

The quantity hp (A) can be thought of as the (probability) flow out of the set A in one step when 
the Markov chain is at stationarity. The conductance of the chain is defined by 

(35) hp = inf h P (A). 

0<tt(A)<1/2 

Intuitively, a small hp implies mixing slowly because the chain may be trapped in a set whose 
measure is < 1/2. On the other hand, a large hp implies mixing rapidly as nowhere is sticky. We 
have the following theorem (Lawler and Sokal, 1988). 
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THEOREM 4.1 (Cheeger's Inequality). Let P be a reversible Markov transition kernel with in- 
variant measure %. Then 

(36) *J < Gap(P) < 2 hp. 

Let k(x, dy) = (1 — s)ki(x, dy) + s k2(x, dy), with < s < 1. Suppose operators P, Pi, and 
P2 are induced by k, ki, respectively. Clearly, 

(37) P = (l-s)Pi + sP 2 . 

It is straightforward to show the following Lemma (Guan and Krone, 2007), which allows one to 
bound the spectral gap for a mixture of kernels. 

Lemma 4.2. Suppose P is defined by (37). Then 

(38) Gap(P) > ^m a x((l- S ) 2 h Pi , S 2 h P2 ); 
The following theorem was proved in Guan and Krone (2007). 

THEOREM 4.3. Suppose ir(x) is an a-smooth log-concave probability density of ' d- dimension 
on a convex set K. Suppose further that ir has barycenter and set M n = J K \x\ n(dx). Then the 
conductance, hp, of the Metropolis-Hastings chain with transition kernel P(x, dy) induced by the 
uniform 5-ball proposal satisfies 

5e~ aS 

(39) hp > 



1024 sfd M w ' 

Combining Equations (38) and (39) we obtain a lower bound of the spectral gap of a local chain 
sampling a log-concave distribution. Set 5 = l/a to see that it is fast-mixing. This and the state 
decomposition theorem leads to a conclusion that a Small- World chain is fast mixing (Guan and 
Krone, 2007). Moreover, because the dimensionality d is in a polynomial form in (39), so there is 
no "curse of dimensionality" with a local chain sampling a log-concave distribution. On the contrary, 
there is a "curse of dimensionality" in the component chain simply because "volumes" of modes 
matter (Guan and Krone, 2007). This justifies our focus on the component chain in Theorems 2.1 
and 2.2. 

4.2. Proof of The Thorems 2.1 and 2.2. We need to establish the connection between a log- 
concave distribution and its powered-up alternatives. 

LEMMA 4.4. Let f(x) be a log-concave distribution on M. d of dimension d, and f(x) attain its 
unique maximum at 0. Let ft(x) oc f[x) 1 ^, where t > 1. Then (a) > t~ d , (b) > yj^j 
for any x £ M d , and (c) if f(x) is a-smooth, then the equality in (a) and (b) holds up to a constant. 
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PROOF. By definition of log-concave, for < s < 1, 

f( sx + ^-s)y)>f(xyf( y ) 1 - s . 
Let s = 1/t and y = 0. Choose h, such that l/t + 1/h = 1, then 
(40) f(x/t) > /(x) 1 /* / (0) 1 / ft . 

Since f Rd f(x)dx = 1, we can get f Rd f(x/t)dx = t d J Rd f(x/t)d(x/t) = t d . So 

/(x) 1 /*/(0) 1/ ^x = /(O) 1 ^ / /(x) 1 /*^ < t d , 



rearrange term to get: j ^/(Jy/tdx — * rf ' Identify the left hand side is /t(0) to prove (a). 

Since /(») is log-concave, for any unit vector u G M d , f{uz) is monotone decreasing in z, hence 
f(zu) 1 ^^ 1 is monotone increasing in z (as we assume i > 1). This proves (b). 

Because of logconcavity, for any unit vector u G Mr, there exists scalars g, v, such that f(uz) < 
e -vz f or z > ^. By a-smooth, we have f(uz) > e~ az . For exponential functions, equality in (40) 
holds and this proves (c). □ 

REMARK 3. Note the bound in (a) is essentially tight for a general log-concave distribution, as 
can be seen clearly from one-dimensional exponential distributions. For distributions such as ( multi- 
variate) normal, the bound in (a) is not tight. However, we are confined by the technical condition of 
a-smoothness, and we do not pursue a better bound. Although the a-smooth is a technical condition 
for the convenience to bound conductance (Guan and Krone, 2007), it is worthwhile to note that it 
is crucial for the tempering scheme as well. Taking an extreme example, the tempering will not help 
for two uniform distributions on two unit discs that are far apart. While tempering is helpful for two 
normal distributions that are far apart, as we will see in Section 5. 

Lemma 4.5. For a piece-wise a-smooth log-concave distributions defined in (8), let £i(x) = 
w 1 / t -Ki(x) 1 ^ t I U, where Ii = w l J l J A _ r Ki{x) x l t dx for i = 1, . . . , m. Let I = h/m and d( x ) = 



IV 



iTiix) 1 / 1 1 1 for each i. Then there exists constants c\,C2 such that c\ < < C2- 



Proof. For any pair w-i > Wj, for t > 1, we have Wj/wi < w^ /w^ < Wi/wj. So that 
without loss of generality, we may assume W{ = 1 for i = 1, . . . , m. By (c) of Lemma 4.4, we have 
ii(Ai) = CiiTi(Ai)t~ d / Li. Since ii{Ai) = iii(Ai) = 1, we have Ij = C{t~ d , which implies that 
£,'i(Ai) I ^i(Ai) = mci/^2ci and the conclusion follows. □ 

REMARK 4. Lemma 4.5 says two different normalizations, namely, normalization within each 
pieces and normalization combining all pieces, are equivalent up to a constant for a piece-wise 
a-smooth log-concave distribution. So we can use the normalization within each pieces to ease the 
presentation. 
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With Lemma 4.4 at hand, the proof of Theorem 2.2 is easy. 

Proof of Theorem 2.2. Consider tt Ai , ^a 2 as the ir restricted on A\ and A 2 respectively, 
which we denote by ni and tt 2 . Let £1, £2 be their normalized powered-up alternatives respectively. 
The conductance bound of P c requires estimate of the integral appeared in (10) 

(42) = f AlxA2 min ^)^(x), 2 {y)dxdy 

(43) =c 1 ^ 1 {A 1 )k 2 {A 2 ), 

where the last equality obtained from Lemma 4.4. Therefore, from Equation (35) we get for some 
constant c 

(44) hu = — ^ / min fi. ^rn\i(x)b(y)dx = 

2vri(Ai) J AlxA2 V 7T1W6(2/)/ * d 

Hence, the conductance of the component chain P c is proportional to t~ d . Following the Cheeger's 
Inequality (36) to get the bound on spectral gap. It is clear that the bound is not a function of the 

"size" of 7r. □ 

Proof of Theorem 2.1. Use same notations defined in the proof of Theorem 2.2. Let h(x, y) 
be the long range proposal. We have 

u 1 I ■ ( \ 71-2 (y) h (y, x )\ 1 \ t / s , , 

hi2 = ~ — Ta~\ 1 111111 [i'—miTf \ Ux{x)h(x,y)dxdy 

2tti(^i) JaxxA 2 V ^{x) h(x,y) J 

1 f . fh{y,x) h(x,y)\ 

/ m111 r^T' rT~ )ni(x)n 2 {y)dxdy 

Ja 1 xA 2 V 7T2 (y) J 



(45) 



> ai- — y-v-T ( TTiixjir^dxdy 
2ni(Ax) J Al xA 2 

= -aiit 2 (A 2 ) 



where 



/ h(y,x) h(x,y) \ 
V 7Ti(x) ' ir 2 (y) J' 



(46) ai = inf min 

x£Ai,y&A 2 

Follow same argument to get for powered up distributions &'s. 

^12 = 9C / , \ / min( 1, ^ 2 \ k }?, )£i(x)h(y)dxdy 

(47) 2^1(^1) y AlXj42 v M%)h{y)J 

> ^a t ^ 2 (A 2 ) 
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where 

(48) a t = mf Mnf^,^ 



Note 7t 2 (j4 2 ) = £2(^.2), so the ratio /i' 12 //ii2 is determined by a t /a\. However, for each x G 

?l(cc) 7ri(a;) 

(49) /i' 12 = c /i 12 t d . 



Ai, y G A 2 , = c t d due to Lemma 4.4 (c). So we have at/a\ = ct d and hence 



For an m x m stochastic matrix P c = (hij), the spectral gap can be bounded from below (Pena, 
2005) by 

(50) Gap(P c ) > m min^-/^-. 

Combine Equations (49) and (50) to finish the proof. □ 

5. Applications. There are three key parameters needed to be specified in applications, namely, 
the proportional of the long-range proposals s, the number of chains (H + 1) and the temperature 
ratio t. The theoretically best value of s is 1/3 because it maximizes the lower bound of the spectral 
gap of a Small-World chain (Guan and Krone, 2007). Indeed, in the following two examples, we use 
s = 0.33. Of course for a specific application one may tune s based on acceptance ratio. We note, 
however, s should keep constant during a MCMC run. To specify H and r we suggest the following 
procedure: First, sample nt(x) and tune t until acceptance ratio of long-range proposal is high, say, 
larger than 0.2. Next, use Algorithm 2.2 to sample 7Tt(x), ir t / T {x), tuning r so that the acceptance 
ratio for long-range proposal of the sampling chain is > 0.2. H can be estimated by [logi/ logr]. 
One may find burn-in and thinning helpful. 

Lastly, the choice of long-range proposal is often problem-dependent. If the state space is M. d , 
we recommend a heavy tailed distribution like Cauchy. When the state space is trees or graphs, a 
long-range proposal is hard to define - we suggest to compound local proposals of randomly many 
times to obtain a long-range proposal. 

5.1. Sampling Needles. In this example, our target distribution is a mixture of normals, f(x) = 
0.5N(x; hi, Si) + 0.5iV(x; m, S 2 ) where Ml = (0, 0)*, fi 2 = (5, 5)* and £1 = E 2 = ( ° 01 ° 01 ). 
To see the minorization and drift condition hold for this example, see Atchade (2010) and references 
therein. 

Our local proposal is two dimensional ball with radii 0.1. The long-range proposal is two dimen- 
sional Cauchy with scale parameter 1. Local chains are trapped in either mode within 1, 000, 000 
iterations (data not shown). We use 6 chains with r = 6 with 1000 burn-in steps in each chain. The 
sampling chain ran 10, 000 steps, which makes the total iterations of the all 6 chains 81, 000 steps. 
Define region A = {x : x\ + x\ < 0.05 2 } andp = Pr(Xn^ 6 A), the probability that the sampling 
chain visits A. Figure 5. 1 is the sample trace of a typical run, where after thinning of every 10 steps, 
the last 1 , 000 samples of each chain were plotted. 
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FIG 1. Each color corresponds to different temperature 6 k , k — 5, 4, . . . , 1, 0. The proportion of the sampling chain that 
visit region A is 44.6% 

We repeat above run 100 times and obtain statistics regarding p. The mean is 0.50 and median is 
0.49, the standard deviation is 0.08, and the 5 and 95 percentiles are 0.37 and 0.62 respectively. We 
repeat simulations for //2 = (25, 25)* without modifying l(x, •) and h(x, •). With an extra chain and 
twice number of iterations, similar results are obtained (data not shown). 

5.2. Sampling Phylogenetic Trees. Markov chain Monte Carlo algorithms plays an important 
role in (Bayesian) phylogenetic inference, perhaps through the wide-spread usage of software pack- 
ages like MrBayes (Ronquist and Huelsenbeck, 2003) and PAML (Yang, 1997). Mossel and Vigoda 
(2005) argued that phylogenetic MCMC algorithms are misleading when data (nucleotides se- 
quences) are generated by mixture of phylogenetic trees. Fixing branch lengths, they generated 
sequence data using two trees that are far apart (that is, local proposals can not reach from one to 
another in one step). They first showed that there is a valley in between the two trees used to simu- 
lated sequence data, and the valley becomes steeper when the sequence length (N) increases. Then 
they argued that existing local samplers takes exponentially long iterations (in N) to move from one 
mode to another. Their theoretical results is essentially the first part of the Theorem 3.1 in Guan 
and Krone (2007). In light of Guan and Krone (2007) and theories presented in this paper, we see 
that the slow mixing problem presented in Mossel and Vigoda (2005) can be simply resolved by a 
Small-World sampler, and better mixing can be achieved by a STEEP sampler. 

In our simulation, we fix the branch lengths the same as those in (Mossel and Vigoda, 2005) with 
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inner branch lengths equal 0.1 and tip branch lengths equal 0.01. We use Jukes-Canter as the evo- 
lutionary model to compute likelihood of different tree topologies. Our local proposal is the nearest 
neighbor interchange (NNI) (c.f. Felsenstein, 2004), and the long-range proposals are simulated 
by compounding multiple (but random many) nearest neighbor interchanges. In this example, the 
minorization and drift condition hold because the state space is finite. 

The five taxa example presented in (Mossel and Vigoda, 2005) is too simple for a simulation 
study. We simulate DNA sequence data (Rambaut and Grass, 1997) based on two generating trees 
(in Newick format) A = (((((1, 2), 3), 4), 5), 6), (7, 8)) and B = (((((1, 7), 3), 4), 5), 6), (2, 8)). 
Note we switch the position of taxa 2 and 7 and it takes 5 NNI to move from tree A to tree B. We 
found it difficult to simulate sequence data from the two trees that results in similar likelihood on 
both, so we simulate sequences of length 1000 from tree A and switch the sequences 2 and 7 to 
obtain new sequences and concatenate two sets of sequences together. By doing so, we ensure that 
tree A and B have the same likelihood. 

We first ran a simple Metropolis-Hastings algorithm and plot the distance between taxa 1 and 2, 
the local chain were trapped within one mode during 1, 000, 000 steps (data not shown). We then ran 
STEEP sampler of 4 chains with r = 10, each chain ran 50, 000 steps with 5, 000 steps of burn-in. 
After thinning every 10 steps, the last 5000 were plotted for each chain. The left panel shows the 
likelihood trace plot of each chain. The right panel shows the distance (between taxa 1 and 2) trace 
plot. Clearly, the chain moves frequently between trees A and B. 




i 



FIG 2. Each color band corresponds to a different chain of temperature 10 fe , k = 3,2, 1, 0. The best log-likelihood of 
trees that is one local proposal (NNI) away from the generating trees is 60.44 smaller than those of generating trees. 
Only two generating trees appear in the last chain and the proportion of the two generating trees are 44.0% and 56.0% 
respectively. 

This toy example demonstrates that the STEEP performs well in mixture of trees where a local 
chain fails. Note in the example we fixed the branch lengths and evolutionary parameters. We in- 
vite authors of MrBayes, PAML and others to further investigate its performance when taking into 
account of branch lengths and evolutionary models. 

6. Discussion. We have presented a new sampling algorithm, proved its ergodicity, and demon- 
strated its usefulness. The analysis of the spectral gap appears to be new. Although the theory is 
presented in the Euclidian state space, we believe it applies to more general state space as well. 
The STEEP algorithm bears similarities with the Equi-Energy sampler. One key difference is that 
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STEEP emphasizes the long-range proposal, through which the STEEP provides a new perspective 
on the advantage of using empirical distribution of tempering. We note that, at least in principle, the 
ergodic theorem and the analysis of the spectral gap apply to the Equi-Energy sampler. The STEEP 
also bears similarities with pure tempering methods such as MCMCMC. We point out three key 
differences. First, STEEP takes advantage of the empirical distributions, while the "swap" in tem- 
pering methods always use the current state of each chain. Second, the interaction of the different 
chains in STEEP is one-way - from high temperature chains to the lower ones. Since the higher 
temperature chains are not affected by the lower temperature chain, the exploring is more efficient. 
Third, tempering relies on local proposals to be more efficient on a. flattened distribution, while 
STEEP relies on long-range proposals to be more efficient on a. fattened distribution. 

The "Powering-up" is convenient to obtain flattened (or fattened) alternative distributions. How- 
ever, for certain type of models, e.g., Ising model and its generalization, Potts model, powering- 
up could run into problems because there might exists a phase transition at a critical temperature 
(Bhatnagar and Randall, 2004). When that happens, distributions of above and below the critical 
temperature may have little similarity; thus it becomes a moot point to use empirical distribution 
of one temperature to generate long-range proposals for another. To circumvent the "phase transi- 
tion" one should discard powering-up scheme. Instead, one may use the "multi-set sampler" (Leman 
et al, 2009) to smooth a distribution to achieve a similar effect as tempering. A multi-set sampler 
augments the state space from M. d to M dxm so that the current state is a vector (x\, ■ ■ ■ x m ). Define 
7r'(xi, • • • , x m ) = — (tt(xi) + • • • + ir(x m )). This averaging effectively gives a smoother marginal 
distribution n'^-) compared to 7r(-). And we can control the smoothness by varying m. 

In this paper, we focus on the mixing between different modes because within each mode local 
proposals guarantee rapidly mixing. However, when there exists a mode that is highly correlated 
among certain dimensions, one needs to fine tune the local proposals. Incorporating certain adaptive 
sampling scheme such as Craiu et al. (2009) for local proposal into STEEP might be desirable. 
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